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Gaussian Processes and Limiting Linear Models 

Robert B. Gramacy and Herbert K. H. Lee* 

Abstract 

Gaussian processes retain the linear model either as a special case, or in the limit. 
We show how this relationship can be exploited when the data are at least partially 
linear. However from the perspective of the Bayesian posterior, the Gaussian processes 
which encode the linear model either have probability of nearly zero or are otherwise 
unattainable without the explicit construction of a prior with the limiting linear model 
in mind. We develop such a prior, and show that its practical benefits extend well 
beyond the computational and conceptual simplicity of the linear model. For example, 
linearity can be extracted on a per-dimension basis, or can be combined with treed 
partition models to yield a highly efficient nonstationary model. Our approach is 
demonstrated on synthetic and real datasets of varying linearity and dimensionality. 

Key words: Gaussian process, nonstationary spatial model, semiparametric regres- 
sion, partition modeling 

1 Introduction 



The Gaussian Process (GP) is a comm on model for fitting arbitrary functions or surfaces 



because of its nonparametric flexibility (jCressie 



199ll ). Such models and their extensions are 
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used i n a wide variety of ap plications, such as computer 



2001 



Santner et al. 



20071 ). and geology (j Chiles and Delfiner 



2003h. environmental m odeling (iGilleland and Nychka 



experiments (IKennedv and O'Haean 



2005 



Caldei 



1999I ). The modeling flexibility of GPs comes with 



large computational requirements and complexities. Sometimes a GP will produce a very 
smooth flt, i.e., one that appears lineaiij. In these cases, there is a lot of computational 
overkill in fltting a GP, when a linear model (LM) will flt just as well. LMs, which can be 
seen as a limiting case of the GP, also avoid numerous issues to do with numerical instability. 
It may therefore be desirable to be able to choose adaptively between a LM and a GP. The 
goal of this paper is to link GPs with standard linear models. One major beneflt is the 
retention of modeling flexibility while greatly improving the cor nputational situation. W e 



20081), 



can make further gains by combining this union with treed GPs (iGramacy and Lee 
and we demonstrate results for both treed GPs as well as for stationary GP models. 

The remainder of the paper is organized as follows. Section [2] reviews the Gaussian 
process (GP) and the treed GP. Section [3] introduces the concept of the limiting linear model 
(LLM) of a GP. Therein we argue that, without further intervention, none of the limiting 
parameterizations, which are at the extremes of the parameter space, lead to a feasible model 
selection prior. However, a more thorough exploratory analysis reveals that there is a broad 
continuum of GP parameterizations that behave like the LM and, importantly, tend to have 
high posterior support when the data is indeed linear. It is these parameterizations which 
we use to motivate our prior, given in Section |U in order to seek out a more parsimonious 
model. We propose a latent variable formulation that leads to a prior over a family of semi- 
parametric models lying between the GP and its LLM that can be used to investigate the 
nature of the influence (in terms of linear versus non-linear) of predictors on the response. 
Section [5] covers some details pertaining to efficient implementation, and gives the results of 



"'^We use "smooth" in this colloquial sense throughout to mean nearly linear, or the opposite of "wavy" 
as opposed to the more technical "infinitely differcntiable" sense. 



extensive experimentation on real and synthetic data, as well as a comparison of our methods 
to other modern approaches to regression. Section [6] concludes with a brief discussion. 

2 Gaussian Processes and Treed Gaussian Processes 

Consider the following Bayesian hierarchical model for a GP with a linear mean for n inputs 
X of dimension mx-, and n responses y: 

y|/3, a\ K ~ iV„(F/3, a^K) a^ ~ IG{aj2, qj2) 

(3\a\ t\ W ~ iV„(/3o, aV^W) r^ ~ /G(a,/2, g,/2) (1) 

/3o ~ iV^(/^, B) W-i ~ W^((pV)-i, p) 

where F = (1, X) has m = mx + 1 columns. A^, IG and W are the Normal, Inverse-Gamma 
and Wishart distributions, respectively. Constants ^t, B, V,p, ao-, O'er, Or, 5't are treated as 
known. The matrix K is constructed from a function K{-,-) of the form ii'(xj,Xfc) = 
ii'*(xj,Xfc) + gSj^k where 6.^. is the Kronecker delta function, g is called the nnf/^fei parameter 
and is included in order to interject measurement error (or random noise) into the stochastic 
process, and K* is a true correlation which we take to be from the separable power family 
with Pi = 2: 



K*i^„ x.|d) = exp I - 5^ i^^,_^^ 1 _ (2) 

Generalizations are straightforward, e.g., see Section [6l The specification of priors for K, 
K*, and their parameters d and g will be deferred until later, as their construction will be 
a central part of this paper. The separable power family allows some input variables to be 
modeled as more highly correlated than others. The isotropic power family is a special case 
(when d = di and p = pi, ior i = 1, . . . , mx)- 

Posterior inference and estimation is straightforward using the Metropolis-Hastings (MH) 



and Gibbs sampling algorithms (JGramacy and Led . l2008l ) . It can be shown that the regres- 
sion coefficients have full conditionals /3|rest ~ 7Vm(/3, cr^Vs), and /3o|rest ~ A^m(/3o) ^/3o)' 
where 



V^ = (F^K-^F + W-V^^)" 



V, = B-Vw-^EtiK 



-1 



3 = V^(F^K-V+W-i/3o/r2), 
3o = V^^,(b-V+W-iE1i/3M 



(3) 



Analytically integrating out f3 and a^ gives a marginal posterior for K (JBerger et al.l . 1200 ll ) 
that can be used to obtain efficient MH draws. 



p(K|y,/3o,W,r2 



where 



IKIIWI 



^2m 



"<^/2 -p ri/ 



v^|(27r)-"y (f)"^^^r[iK 



n 






■p(K) 



The predicted value of y at x is normally distributed with mean and variance 



(4) 



y(x)=f' (x)/3+k(x)'K-^(y-F/3), 



a(x)^ = a^[«:(x,x)-q'(x)C-^q(x)], (5) 



where f3 is the posterior mean estimate of /3, C ^ = (K + r^FWF''') ^, q(x) = k(x) + 
r2FWf(x), K(x,y) = i^(x,y) + r2fT(x)Wf(y), fT(x) = (1,xT), and k(x) is a n-vector 
with k(x)j = -ft^f x, Xj), for all x, € X, the training data. 



A treed GP ( iGramacy am 



d Led. 



and Regression Tree) model (jChipman et al. 



20081) is ageneralization of the CART (Classification 



19981 ) that uses GPs at the leave s of th e tree 



Chipman et al. 



(12002). The 



in place of the usual constant values or the linear regressions oi[ 
Bayesian interpretation requires a prior be placed on the tree and GP parameterizations. 
Sampling commences with Reversible Jump (RJ) MCMC which allows for a simultaneous 
fit of the tree and the GPs at its leaves. The predictive surface can be discontinuous across 



the partition boundaries of a particular tree T. However, in the aggregate of samples col- 
lected from the joint posterior distribution of {T,0}, the mean tends to smooth out near 
likely partition bound aries as the RJ-MCMC in tegrates over trees and GPs according to the 



posterior distribution fJGramacy and Lee 



20081) 



The treed GP approach yields an extremely fast implementation of nonstationary GPs, 
providing a divide-and-conquer approach to spatial modeling. Software implementing the 
treed GP model and all of its special cases (e.g., stationary GP, CART & the treed linear 
model, linear model, etc.), including the extens ions discussed in this paper, is available as 



an R package (JR Development Core Team 



20041 ). and can be obtained from GRAN: 



http : //www . cran . r-pro j ect . org/web/packages/tgp/index . html. 

The package implements a family of default prior specifications for the known constants in 
Eq. ([1]), and those described in the following sections, which are used throughout unless 



otherwise no ted. For more d etails see the tgp documentation (JGramacy and Taddy 



and tutorial (JGramacy 



20071 ). 



20081) 



3 Limiting Linear Models 

A special limiting case of the GP model is the standard linear model (LM). Replacing the 
top (likelihood) line in the hierarchical model given in ([1]) 



y\(3,a\Kr^N^{F(3,a'K) 



with 



y\(3,a^r^N^{F(3,aH), 



where I is the n x n identity matrix, gives a parameterization of a LM. From a phenomeno- 
logical perspective, GP regression is more flexible than standard linear regression in that it 
can capture nonlinearities in the interaction between covariates (x) and responses (y) . From 
a modeling perspective, the GP can be more than just overkill for linear data. Parsimony 



and over-fitting considerations are just the tip of tlie iceberg. It is also unnecessarily compu- 
tationally expensive, as well as numerically unstable. Specifically, it requires the inversion 
of a large covariance matrix — an operation whose computing cost grows with the cube of 
the sample size, n. Moreover, large finite d parameters can be problematic from a numerical 
perspective because, unless g is also large, the resulting covariance matrix can be numerically 
singular when the off-diagonal elements of K are nearly one. 

It is common practice to scale the inputs (x) either to lie in the unit cube, or to have 
a mean of zero and a range of one. Scaled data and mostly linear predictive surfaces can 
result in almost singular covariance matrices even when the range parameter is relatively 
small [2 < di <^ oo). So for some parameterizations, the GP is operationally equivalent 
to the limiting linear model (LLM), but comes with none of its benefits (e.g. speed and 
stability). As this paper demonstrates, exploiting and/or manipulating such equivalence can 
be of great practical benefit. As Bayesians, this means constructing a prior distribution 
on K that makes it clear in which situations each model is preferred (i.e., when should 
K — > ell). Our key idea is to specify a prior on a "jumping" criterion between the GP and 
its LLM by taking advantage of a latent variable formulation, thus setting up a Bayesian 
model selection/averaging framework. 

Theoretically, there are only two parameterizations of a GP correlation structure {K) 
which encode the LLM. Though they are indeed well-known, without intervention they are 
quite unhelpful from the perspective of practical estimation and inference. The first one is 
when the range parameter (d) is set to zero. In this case K = (1 + (7)1, and the result is 
cle arly a li n ear rn odel. The other parameterization may be less obvious. 



Cressid ( 1991 



Section 3.2.1) analyzes the "effect of variogram parameters on kriging" 
paying special attention to the nugget {g) and its interaction with the range parameter. He 
remarks that the larger the nugget the more the kriging interpolator smoothes and in the 
limit predicts with the linear mean. Perhaps more relevant to the forthcoming discussion is 



his later remarks on the interplay between the range and nugget parameters in determining 
the kriging neighborhood. Specifically, a large nugget coupled with a large range drives the 
interpolator towards the linear mean. This is refreshing since constructing a prior for the 
LLM by exploiting the former GP parameterization (range d ^ 0) is difficult, and for 
the latter (nugget g -^ oo) near impossible. We regard these parameterizations, which are 
situated at the extremes of the parameter space, as a dead-end as far as serving as the basis 
for a model-selection prior. Fortunately, Cressie's thoughts on the kriging neighborhood 
reveal that an (essentially) linear model may be attainable with nonzero d and finite g. 

3.1 Exploratory analysis 

Here we shall conduct an exploratory analysis to study the kriging neighborhood and look for 
a platform from which to "jump" to the LLM. The analysis will focus on studying likelihoods 
and posteriors for GPs fit to data generated from the linear model 

yi = l + 2x, + e, where e^ ~ iV(0, 1) (6) 

using n = 10 evenly spaced x- values in the range [0, 1]. 

3.1.1 GP likelihoods on linear data 

Figure [1] shows two interesting samples from ([6]). Also plotted are the generating line (dot- 
dashed), the maximum likelihood (ML) linear model (/3) line (dashed), the predictive mean 
surface of the ML GP, maximized over range (i, i.e., the one-vector d, nugget gf, and [cr^|(i, g] 
(solid), and its 95% error bars (dotted). The ML values of d and g are also indicated in each 
plot. The GP likelihoods were evaluated for ML estimates of the regression coefficients /3. 



Conditioning on g and d, the ML variance was computed by solving 



. jL log ^„„|F/5, .^K) ^ -4 + (i^«^K-'(y - m 



da^ 



a" 



a 



2^2 



This gave an MLE with the form (5"^ = (y — F/3)^K ^(y — Ff3)/n. For the hnear model 
the likelihood was evaluated as -P(y) = A'^io(F/3, a^I), and for the GP as P{y\d,g) = 



N- 



10 



F/3, a^Kjd^} , where Kj^^} is the covariance matrix generated using K{-, ■) = K*{-, ■\d) + 



g6.^. for K*{-, -{d) from the power family with range parameter d. 



10 samples from y=1+2x+e, e~N(0,1) 



likelihood: Linear & GP 




10 samples from y=1+2x+e, e~N(0,1) 








likelihood: Linear &GP 
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Figure 1: Two simulations (rows) from ?/j = 1 + 2xj + ej, e^ ~ A^(0, 1). Left column shows 
GP fit (solid) with 95% error bars (dotted), maximum likelihood /3 (dashed), and generating 
linear model (/3 = (1,2)) (dot-dashed). Right column shows GP{d,g) likelihood surfaces, 
with each curve representing a different value of the nugget g. The (maximum) likelihood 
(/3) of the linear model is indicated by the solid horizontal line. 



Both samples and fits plotted in Figure [T] have linear looking predictive surfaces, but 
only for the one in the top row does the linear model have the maximum likelihood. Though 



the predictive surface in the bottom-left panel could be mistaken as "linear" , it was indeed 
generated from a GP with large range parameter {d = 2) and modest nugget setting (g) 
as this parameterization had higher likelihood than the linear model. The right column of 
Figure [1] shows likelihood surfaces corresponding to the samples in the left column. Each 
curve corresponds to a different setting of the nugget, g. Also shown is the likelihood value 
of the MLE (3 of the linear model (solid horizontal line). The likelihood surfaces for each 
sample look drastically different. In the top sample the LLM {d = 0) uniformly dominates all 
other GP parameterizations. Contrast this with the likelihood of the second sample. There, 
the resulting predictive surface looks linear, but the likelihood of the LLM is comparatively 
low. 



likelihood for increasing nugget (g) 



f i- 



Figure 2: Likelihoods as the nugget gets large for single sample of size n = 100 from Eq. ([6]). 
The X-axis is (log (7), the range is fixed at c? = 1; the likelihood of the LLM {d = 0) is shown 
for comparison. 

Figure [2] illustrates the other LLM parameterization by showing how, as the nugget g 
increases, the likelihood of the GP approaches that of the linear model for a sample of size 
ra = 100 from ([6]) with d fixed to one. Observe that the nugget must be quite large relative 
to the actual variability in the data before the likelihoods of the GP and LLM become 
comparable. 

Most simulations from (Q gave predictive surfaces like the upper left of Figure [H and 
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10 samples from y=1+2x+e, e~N{0,1) 



likelihood: Linear & GP 



d = 0.032064 
9 = 0.16667 



10samplesfromy=1+2x+e. e~N{0.1) 



likelihood: Linear & GP 



d = 0.024048 
9 = 




Figure 3: GP{d,g) fits {left) and likelihood surfaces {right) for two of samples of the LM ([6]). 

corresponding likelihoods like the upper-right. But this is not always the case. Occasionally 
a simulation would give high likelihood to GP parameterizations if the sample was slowly 
waving. This is not uncommon for small sample sizes such as ra = 10. For example, consider 
those shown in Figure O Waviness becomes less likely as the sample size n grows. 

Figure H] summarizes the ratio of the ML GP parameterization over the ML LM based 
on 1000 simulations of ten evenly spaced random draws from ([6]). A likelihood ratio of 
one means that the LM was best for a particular sample. The 90%-quantile histogram and 
summary statistics in Figure H] show that the GP is seldom much better than the LM. For 
some samples the ratio can be really large (> 9000) in favor of the GP, but more than 
two-thirds of the ratios are close to one — approximately 1/3 (362) were exactly one but 2/3 
(616) had ratios less than 1.5. This means that the posterior inference for borderline linear 
data is likely to depend heavily the prior specification of K{-, ■). 
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ratio: GP/Linear Likelihoods 





Summary: 




Min 


1.000 




1st Qu. 


1.000 




Median 


1.141 




Mean 


19.160 




3rd Qu. 


2.619 




Max 


9419.0 



maximum likelihood 



Figure 4: Histogram of the ratio of the maximuni hkehhood GP parameterization over the 
hkehhood of the hmiting hnear model with full summary statistics. For visual clarity, the 
upper tail of the histogram is not shown. 



For some of the smaller nugget values, in particular g = 0, and larger range settings d, the 
likelihoods for the GP could not be computed because the corresponding covariance matrices 
were numerically singular, and could not be decomposed. This illustrates a phenomenon 
noted by Neal (1997) who advocates that a non-zero nugget (or jitter) should be included 
in the model to increase numerical stability. Numerical instabilities may also be avoided by 
allowing pj < 2 in ([2]), by using the Matern family of correlation functions (see Section |6]), 
or by simply using an LM where appropriate. 

3.1.2 GP posteriors on linear data 

Suppose that rather than examining the multivariate normal likelihoods of the linear and 
GP model, using the ML /3 and a^ values, the marginalized posterior p(K|y,/3g, r^, W) of 
Eq. (j4]) was used, which integrates out f3 and o"^. Using (j4j) requires specification of the prior 
p(K), which for the power family means specifying p{d, g). Alternatively, one could consider 
dropping the p{d, g) term from (j3]) and look solely at the marginalized likelihood. However, 
in light of the arguments above, there is reason to believe that the prior specification will 



11 



carry significant weight. 

If it is suspected tliat tlie data miglit be linear, tliis bias sfiould be someliow encoded in 
tlie prior. This is a non-trivial task, given the nature of the GP parameterizations which 
encode the LLM. Pushing d towards zero is problematic because small non-zero d causes the 
predictive surface to be quite wiggly — certainly far from linear. Deciding how small the range 
parameter jd) should be before treating it as zero — as in Stoc hastic Search Varia ble Selection 



(SSVS) of 



Gilks et al. 



(|l996|)— while still 



George and McCullochl (119931 ). or Chapter 12 of 
allowing a GP to fit truly non-linear data is no simple task. The large nugget approach is 
also out of the question because putting increasing prior density on a parameter as it gets 
large is impossible. Rescaling the responses might work, but constructing the prior would 
be nontrivial. Moreover, such an approach would preclude its use in many applications, 
particularly for adaptive sampling or sequential design of experiments when one hopes to 
learn about the range of responses, and/or search for extrema. 

However, we have seen that for a continuum of large d values (say d > 0.5 on the unit 
interval) the predictive surface is practically linear. Consider a mixture of gammas prior for 
d: 



p{d, g) = p{d)p{g) = p{g) [G{d\a = 1, /? = 20) + G{d\a = 10, /5 = 10)]/2. 



(7) 



It gives roughly equal mass to small d (mean 1/20) representing a population of GP parame- 
terizations for wavy surfaces, and a separate population for those which are quite smooth or 
approximately linear. Figure [8] depicts p{d) via a histogram, ignoring p{g) which is usually 
taken to be a simple exponential distribution. Altern atively, one could e ncode the prior 



200 ih for p{d\g). We 



as p{d, g) = p{d\g)p(g) and then use a reference prior (jBerger et al. 

chose the more deliberate, independent, specification in order to encode our prior belief that 

there are essentially two kinds of processes: wavy (small d) and smooth (large d). Observe 
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that (i = is "closer" to the wavy parameterizations — in fact it hes at the hmit of extreme 
waviness. We argue, however, that extreme smoothness may not only be a more intuitive de- 
piction of linearity, but that large d-vaAnes (though further away) serve as a better platform 
from which to jump to the LLM. 

Evaluation of the marginalized posterior (jlj) requires settings for the prior mean coeffi- 
cients /3q, covariance r^W, and hierarchical specifications {a^, 7o-) for a^. For now, these 
parameter settings are fixed to those which were known to generate the data. 




Figure 5: Top row shows the GP{d,g) fits; Middle row shows likelihoods and bottom row 
shows the integrated posterior distribution for range {d, x-axis) and nugget {g, lines) settings 
for three samples, one per each column. Note that s2 = o"^ in the top row legend(s). 



Figure [5]shows three samples from the linear model ([6]) along with likelihood and posterior 
surfaces. Occasionally the likelihood and posterior lines suddenly stop due to a numerically 
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unstable parameterization (JNeall . 119971 ). The GP fits shown in the top row of the figure are 
based on the maximum a posteriori (MAP) estimates of d, g, and a^. The posteriors in 
the bottom row clearly show the influence of the prior. Nevertheless, the posterior density 
for large cZ-values is disproportionately high relative to the prior. Large (i-values represent 
at least 90% of the cumulative posterior distribution. Samples from these posteriors would 
yield mostly linear predictive surfaces. The last sample is particularly interesting as well as 
being the most representative. Here, the LLM {d = 0) is the MAP GP parameterization, and 
uniformly dominates all other parameterizations in posterior density. Still, the cumulative 
posterior density favors large d-values, thus favoring linear "looking" predictive surfaces over 
the actual (limiting) linear parameterization. 

Figure [6] {top left) shows a representative MAP GP fit for a sample of size n = 100 from 
(j6]). Since larger samples have a lower probability of coming out wavy, the likelihood of the 
LLM (on the right) is much higher than other GP parameterizations, though it is severely 
peaked. Small, nonzero, (i-values have extremely low likelihood. By contrast, the posterior 
in the bottom panel puts high density on large d- values. The MAP predictive surface {top 
left) has a very small, but noticeable, amount of curvature. Ideally, linear looking predictive 
surfaces should not have to bear the computational burden implied by full-fledged GPs. 
But since the LLM {d = 0) is a point-mass (which is the only parameterization that actually 
gives an identity covariance matrix), it has zero probability under the posterior. It would 
never be sampled in an MCMC, even when it did happen to be the MAP estimate. 

3.1.3 GP posteriors and likelihoods on non-linear data 

For completeness. Figure [7| and shows fits, likelihoods, and posteriors on non-linear data. The 
first column of Figure [7] corresponds to a sample with quadratic mean, and each successive 
column corresponds to a sample which is increasingly wavy. Each sample is of size n = 50. 
The shape of the prior loses its influence as the data become more non-linear. Although in 

14 



100 samples: y=1+2x+e, e-N{0,0.25) 



likelihood : Linear & GP(d,nug) 




posterior : Linear & GP(d,nug) 




Figure 6: Top-left shows the GP{d,g) fit with a sample of size n = 100; top-right shows the 
likehhood and bottom-right shows the integrated posterior for range {d, x-axis) and nugget 
{g, hues) settings. 



all three cases the MLEs do not correspond to the MAP estimates, the corresponding ML 
and MAP predictive surfaces look remarkably similar (not shown). This is probably due to 
the fact that the posterior integrates out f3 and a^, whereas the likelihoods were computed 
with point estimates of these parameters. 



4 Model selection prior 



With the ideas outlined above, we set out to construct a prior for the "mixture" of the 
GP with its LLM by focusing on large range parameters rather than d = or (7 ^ oo. 
The key idea is an augmentation of the parameter space by mx latent indicator variables 
b = {b}^^ G {0, 1}*"-^. The boolean 6j is intended to select either the GP (6j = 1) or its 
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50 samples: v=3x''2*b, e~N(D.0.25) 



SO samples: v=-2Q(x-D.5)'^3+2x-2+e, e~N(0,0.2S) 



50 samples: y=5l[1(pl'li/5) * 0.2*COS(4p|-x/5), e~N(0,D.18) 




likelihood : Linear & GP(d,nug) 




poslerror : Linear & GP(d,ni 



d - 0.090909 
g - 0.088889 
s2 - 0.64505 




likelihood : Linear & GP(d.ni 




posterior : Linear & GP(d,nug) 




likelihood : Linear & GP(d,nug) 




posterior : Linear & GP(d.n 




Figure 7: Top row shows the GP {d,g) fits; Middle row shows hkehhoods and bottom row 
shows the integrated posterior distribution for range {d, x-axis) and nugget {g, lines) settings 
for four samples, one per each column .As the samples become less linear the d-axis (x-axis) 
shrinks in order to focus in on the mode. 



LLM for the z* dimension. The actual range parameter used by the correlation function 
is multiplied by b: e.g. K*{-, ■|bd)G. To encode our preference that GPs with larger range 
parameters should be more likely to "jump" to the LLM, the prior on bi is specified as a 
function of the range parameter df. p{bi, di) = p{bi\di)p{di) . 

Probability mass functions which increase as a function of di, e.g.. 



P^,e„9,ih = 0\di) = 9, + {02 - 9i)/{l + exp{-7(rf, - 0.5)}) 



^i.e., component-wise multiplication — like the "b.*d" operation in Matlab 



(8) 
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p(d) = G(1,20)+G(10,10) and p(b|d) 



p(b) = 1 
p(b|cl) 




0.0 



0.5 1.0 1.5 2.0 

d 



Figure 8: Histogram of the mixture of gammas prior p{d) as given in Eq. ([7]), with the 
prior distribution for the boolean (6) superimposed on p{d) from Eq.® using (7,^1,^2) = 
(10,0.2,0.95) . 

with < 7 and < ^1 < 6*2 < 1, can encode such a preference by caUing for the exclusion of 
dimensions i with with large di when constructing K. Thus bi determines whether the GP 
or the LLM is in charge of the marginal process in the i* dimension. Accordingly, 6*1 and 
62 represent minimum and maximum probabilities of jumping to the LLM, while 7 governs 
the rate at which p{bi = 0\di) grows to 62 as di increases. Figure [8] plots p{bi = 0\di) for 
(7, 9i, 62) = (10, 0.2, 0.95) superimposed on the mixture of gammas prior p{di) from ([7]). The 
62 parameter is taken to be strictly less than one so as not to preclude a GP which models 
a genuinely nonlinear surface using an uncommonly large range setting. 
The implied prior probability of the full mjc-dimensional LLM is 



mx 



mx 



p(linear model) = 1 I p(&i = 0|(ij) = 1 I 



i=l 



i=l 



^1 + 



62 — 9i 



l + exp{-7((ii-0.5)}J 



(9) 



Observe that the resulting process is still a GP if any of the booleans bi are one. The 
primary computational advantage associated with the LLM is foregone unless all of the fej's 
are zero. However, the intermediate result offers an improvement in numerical stability in 
addition to describing a unique transitionary model lying somewhere between the GP and the 
LLM. Specifically, it allows for the implementation of semiparametric stochastic processes like 
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Z{x.) = /3/(x)+£:(x), representing a piecemeal spatial extension of a simple linear model. The 
first part (/3/(x)) of the process is linear in some known function of the full set of covariates 
X = {xi}^^, and e(-) is a spatial random process (e.g., a GP) which acts on a subset o 



the c ovariates x. Such models are commonplace in the statistics community fJDey et al 



19981 ). Traditionally, x is determined and fixed a priori. The separable boolean prior in ([8]) 
implements an adaptively semiparametric process where the subset x = {xj : bi = l,i = 
1, . . . , mx} is given a prior distribution, instead of being fixed. 

So even if the computational advantage of the LLM is not present because some bi ^ Owe 
still have that any bi = setting releases us from the burden of sampling the corresponding di. 
It also imparts on us the knowledge that the response has (at best) a linear relationship with 
the 2*'' covariate. This approach may also increase the scope for analysis of higher dimensional 
datasets, where data sparseness in higher dimensions (the "curse of dimensionality" ) can be 
ameliorated by using linear models in most dimensions and GP models only in the dimensions 
where they will have the most effect, thus reducing the dimension of the GP model space. 

Note that if an isotropic correlation function is used, which has only a single range 
parameter, then only one boolean b is needed, and the product can be dropped from ([9]). 
In this case, however, the advantage of being able to detect linearity, marginally, in each 
dimension is lost. 

4.1 Prediction 

Prediction under the LLM parameterization of the GP model ([5]) can be simplified when 
all of the booleans are zero, whence it is known that K = (1 + g)I. A characteristic of 
the standard linear model is that all input configurations (x) are treated as independent 
conditional on knowing f3. This additionally implies that in ([5]) the terms k{x.) and /^(x, Xj) 
are zero for all x. Thus, the predicted value of y at x is normally distributed with mean 



18 



y(x) = i~^ {x.)(3 and variance 



-2fT 



Tn-1i 



a'[l + r'f ' (x)Wf (x) - r'{ ' (x)WF ' ((1 + g)l + r'FWF ' )-^FWf (x 



It is helpful to re-write the above expression for the variance as 



a(x)' = (T'[l + r'f'(x)Wf(x)] 



a 



1 + 9 



1 + g 



(10) 



2 / 2 \ ^1 

1 + r^rfxlWf fx) - ^^f^WWF^ f I ' ^ -c^^^^-c^T \ T:.,,rf /„X_2 



FWF' FWffxlr' 



A matrix inversion lemma called the Woodbury 



formula (IGolub and Van Loanl . ll996l . pp. 51) 



20051 . pp. 67) states that for (I + 



or the Sherman-Morrison- Woodbury formula (iBernsteinl . 

V^AV) non-singular (A-^ + VY^)-^ = A- (AV)(I + V^AV)-ivTA. Taking V = F^(l 

g)~2 and A = r^W in flTOJ) gives 



aU? = a^ 



f^( 



W 



-1 



r^ 



F^F 



-1 



ffx) 



'111 



Eq. ( TTT1) is not only a simplification of the predictive variance given in ([5]), but it should look 
familiar. Writing Vs with K~^ = 1/(1 + (7) in (l3j) gives 



V; 



w 



r^ 



F^F 

'1 + 9 



and then: ^(x)^ = a^ [l + f^(x)V^f (x)] . (12) 



This is just the usual posterior predictive density at x under the standard linear model: 
y(x) ~ N[t~^ (x.)/3, cr^(l + f"'^(x)V3f (x))]. This means that we have a choice when it comes 
to obtaining samples from the posterior predictive distribution under the LLM. We prefer 
(TT2|) over ([5]) because the latter involves inverting the n x n matrix I + r^FWF~''/(l + g), 
whereas the former only requires the inversion of an m x ??2 matrix. Of course, if any of the 
booleans are non-zero, then GP prediction must proceed as usual, following Eq. ([5]). 
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GP fits to typical nonlinear response surfaces may seldom achieve bi = for all i. How- 
ever, the following section illustrates how treed partitioning can dramatically increase the 
probability of jumping to the LLM parameterization (in at least part of the input space) 
where the more thrifty predictive equations flT2l) may be exploited. 

5 Implementation, results, and comparisons 

Here, the GP with jumps to the LLM (hereafter GP LLM) is illustrated on synthetic and 
real data. This wor k grew out of research fo cused on extending the reach of the treed GP 



model presented by iGramacy and Led (J2008l ). whereby the data are recursively partitioned 
and a separate GP is fit in each partition. Thus most of our experiments are in this context, 
though in Section 15.31 we demonstrate an example without treed partitioning. Partition 
models are an ideal setting for evaluating the utility of the GP LLM, as linearity can be 
extracted in large areas of the input space. The result is a uniquely tractable nonstationary 
semiparametric spatial model. 

A separable correlation function is used throughout this section for brevity and consis- 
tency, even though in some cases the process which generated the data is clearly isotropic. 
Proposals for the booleans b are drawn from the prior, conditional on d, and accepted and 
rejected on the basis of the constructed covariance matrix K. The same prior parameteriza- 
tions are used for all experiments unless otherwise noted, the idea being to develop a method 
that works "right out of the box" as much as possible. 

5.1 Synthetic exponential data 

Consider the 2-d input space [—2, 6] x [—2, 6] in which the true response is given by 5^(x) = 
Xiexp(— x^ — Xg) + e, where e ~ A^(0,cr = 0.001). Figure [9] summarizes the consequences 
of estimation and prediction with the treed GP LLM for a n = 200 random sub-sample of 
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simple exponential data 



exp: areas under limiting iinear model 




portion of area in domain (x) 



Figure 9: Left: exponential data GP LLM fit. Right: histogram of the areas under the 
LLM. 



this data from a regular grid of size 441. The partitioning structure of the treed GP LLM 
first splits the region into two halves, one of which can be fit linearly. It then recursively 
partitions the half with the action into a piece which requires a GP and another piece which 
is also linear. The left pane shows a mean predictive surface wherein the LLM was used over 
66% of the domain (on average) which was obtained in less than ten seconds on a L8 GHz 
Athalon. The right pane shows a histogram of the areas of the domain under the LLM over 
20-fold repeated experiments. The four modes of the histogram clump around 0%, 25%, 
50%, and 75% showing that most often the obvious three-quarters of the space is under the 
LLM, although sometimes one of the two partitions will use a very smooth GP. The treed 
GP LLM was 40% faster than the treed GP alone when combining estimation and sampling 
from the posterior predictive distributions at the remaining n' = 241 points from the grid. 



5.2 Motorcycle Data 



The Motorcycle Accident Dataset (j Silverman 



19851 ) is a classic for illustrating nonstationary 



models. It samples the acceleration force on the head of a motorcycle rider as a function 
of time in the first moments after an impact. Figure [TUJ shows the data and a fit using the 
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treed GP LLM. The plot shows the mean predictive surface with 90% quantile error bars, 
along with a typical partition. On average, 29% of the domain was under the LLM, split 
between the left low-noise region (before impact) and the noisier rightmost region. 



Estimated Surface 




Figure 10: Motorcycle Data fit by treed GP LLM. 



Rasmussen and Ghahramanil (120021 ) analyzed this data using a Dirichlet process mixture 
of Gaussian process (DPGP) experts which reportedly took one hour on a 1 GHz Pentium. 
Such times are typical of inference under nonstationary models because of the computational 
effort required to construct and invert large covariance matrices. In contrast, the treed GP 
LLM fits this dataset with comparable accuracy but in less than one minute on a 1.8 GHz 
Athalon. 

We identify three things which make the treed GP LLM so fast relative to most nonsta- 
tionary spatial models. (1) Partitioning fits models to less data, yielding smaller matrices to 
invert. (2) Jumps to the LLM mean fewer inversions all together. (3) MCMC mixes better 
because under the LLM the parameters d and g are out of the picture and all sampling can 
be performed via Gibbs steps. 
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5.3 Friedman data 

This Friedman dataset is the first on e of a suite that was used to illustrate MARS (Multi- 



variate Adaptive Regression Splines) ([Friedman 



199ll ). There are 10 covariates in the data 



(x = {xi,X2, . . . ,Xio}), but the function that describes the responses (Y), observed with 
standard Normal noise, 

E(F|x) = /x = 10 sin(7ra;ia;2) + 20(^3 - 0.5)^ + 10x4 + Sxg (13) 

depends only on {xi, . . . ,0:5}, thus combining nonlinear, linear, and irrelevant effects. We 
make cor nparisons on th is data to results provided for several other models in recent lit- 



erature. 



Chipman et al 



(I2OO2I ) used this data to compare their treed LM algorithm to 
four other methods of varying parameterization: linear regression, greedy tree, MARS, 
and neural networks. The statistic they use for comparison is root mean-squared error. 



RMSE = \/Yl^=iif^i ~ ^i)^ I'^i where Yi is the model-predicted response for input Xj. The 
x's are randomly distributed on the unit interval. RMSE's are gathered for fifty repeated 
simulations of size n = 100 from (IT3l) . Chipman et al. provide a nice collection of boxplots 
showing the results. However, they do not provide any numerical results, so we have ex- 
tracted some key numbers from their plots and refer the reader to their paper for the full 
results. 

We duplicated the experiment using both a stationary GP and our GP LLM. For this 
dataset, we use a single model, not a treed model, as the function is essentially stationary 
in the spatial statistical sense (so if we were to try to fit a treed GP, it would keep all of the 
data in a single partition). Linearizing boolean prior parameters (7,^1,6*2) = (10,0.2,0.9) 
were used, which gave the LLM a relatively low prior probability of 0.35, for large range 
parameters di. The RMSEs that we obtained for the stationary GP and the GP LLM are 
summarized in the table below. 
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Mill 1st Qu. Median Mean 3rd Qu. 



Max 



GP LLM 


0.4341 


0.5743 


0.6233 


0.6258 


0.6707 


0.7891 


GP 


0.8196 


0.8835 


0.9131 


0.9232 


0.9710 


1.0440 


Linear 


1.710 


2.165 


2.291 


2.325 


2.500 


2.794 



Results on the linear model are reported for calibration purposes, and can be seen to be 
essentially the same as those reported by Chipman et al. RMSEs for the GP LLM are 
on average significantly better than all of those reported for the above methods, with lower 
variance. For example, the best mean RMSE shown in the boxplot is ~ 0.9. That is 1.4 times 
higher than the worst one we obtained for GP LLM. Further comparison to the boxplots 
provided by Chipman et al. shows that the GP LLM is the clear winner. It is also clear that 
jumping to a linear model in the relevant dimensions provides a more stable fit that gives 
improved predictive performance relative to a full (stationary) GP. 

In fitting the model, the Markov Chain quickly keyed in on the fact that only the first 
three covariates contribute nonlinearly. After burn-in, the booleans b almost never devi- 
ated from (1, 1, 1, 0, 0, 0, 0, 0, 0, 0). From the following table summarizing the posterior for 
the linear regression coefficients (/3) we can see that the coefficients for x^ and X5 (be- 
tween double-bars) were estimated accurately, and that the model correctly determined that 
{xq, . . .Xio} were irrelevant (i.e. not included in the GP, and had /3's close to zero). 



/3 



5% Qu. 

Mean 

95% Qu. 



X4 



X5 



8.40 2.60 

9.75 4.59 

10.99 9.98 



xe 



X7 



Xg 



Xio 



-1.23 -0.89 

-0.190 0.049 

0.92 1.00 



-1.82 -0.60 - 0.91 

-0.612 0.326 0.066 

0.68 1.21 1.02 



For a final comparison we consider a Support Vector M achine (SVM) rn ethod (JDrucker et al. 



19961 ). We note that the 



19961 ) illustrated on this data and compared to Bagging (JBreiman . 
SVM method required cross-validation (CV) to set some of its parameters. In the compari- 
son, 100 randomized training sets of size n = 200 were used, and RMSEs were collected for a 
(single) test set of size n' = 1000. An average MSE of 0.67 is reported, showing the SVM to 
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be uniformly better the Bagging method with an MSE of 2.26. We repeated the experiment 
for the GP LLM (which requires no CV!), and obtained an average MSE of 0.293, which is 
2.28 times better than the SVM, and 7.71 times better than Bagging. 

5.4 Boston housing data 

A commonly used dataset for validating multivariate models is the Boston H ousing Data 



Harrison and Rubinfeld 



Chipman et al 



19781 ). which contains 506 responses over 13 covariates. 
(J2OO2I ) showed that their (Bayesian) treed LM gave lower RMSEs, on average, compared to 
a number of popular techniques (the same ones listed in the previous section). Here we em- 
ployed a treed GP LLM, which is a generalization of their treed LM, retaining the original 
treed LM as an accessible special case. Though computationally more intensive than the 
treed LM, the treed GP LLM gives impressive results. To mitigate some of the computa- 
tional demands, the LLM can be used to initialize the Markov Chain by breaking the larger 
dataset into smaller partitions. Before treed GP burn-in begins, the model is fit using only 
the faster (limiting) treed LM model. Once the treed partitioning has stabilized, this fit 
is taken as the starting value for a full MCMC exploration of the posterior for the treed 
GP LLM. This initialization process allows us to fit GPs on smaller segments of the data, 
reducing the size of matrices that need to be inverted and greatly reducing computation 
time. For the Boston Housing data we use (7,^1,^2) = (10,0.2,0.95), which gives the LLM 
a prior probability of 0.95^^ ~ 0.51, when the dj^ s are large. 



2002) consist of calculating 



Experiments in the Bayesian treed LM paper (1 Chipman et al. 
RMSEs via 10-fold CV. The data are randomly partitioned into 10 groups, iteratively trained 
on 9/10 of the data, and tested on the remaining 1/10. This is repeated for 20 random 
partitions, and boxplots are shown. Note that the logarithm of the response is used and that 
CV is only used to assess predictive error, not to tune parameters. Samples are gathered 
from the posterior predictive distribution of the treed LM for six parameterizations using 20 

25 



restarts of 4000 iterations. This seems excessive, but we followed suit for the treed GP LLM 
in order to obtain a fair comparison. Our "boxplot" for training and testing RMSEs are 
summarized in the table below. As before, linear regression (on the log responses) is used 
for calibration. 

Min 1st Qu. Median Mean 3rd Qu. Max 



train GP LLM 
Linear 



test GP LLM 
Linear 



0.0701 0.0716 0.0724 0.0728 0.0730 0.0818 
0.1868 0.1869 0.1869 0.1869 0.1869 0.1870 



0.1321 0.1327 0.1346 0.1346 0.1356 0.1389 
0.1926 0.1945 0.1950 0.1950 0.1953 0.1982 



Notice that the RMSEs for the linear model have extremely low variability. This is similar 
to the results provided by Chipman et al. and was a key factor in determining that our 
experiment was well-calibrated. Upon comparison of the above numbers with the boxplots 
in Chipman et al., it can readily be seen that the treed GP LLM is leaps and bounds better 
than the treed LM, and all of the other methods in the study. Our worst training RMSE is 
almost two times lower than the best ones from the boxplot. All of our testing RMSEs are 
lower than the lowest ones from the boxplot, and our median RMSE (0.1346) is 1.26 times 
lower than the low ^est median RMSE (^ 0.17) from the boxplot. 



More recently. 



Chu et al. 



(120041 ) performed a similar experiment (see Table V), but in- 
stead of 10-fold CV, they randomly partitioned the data 100 times into training/test sets 
of size 481/25 and reported average MSEs on the un-transformed responses. They compare 
their Bayesian SVM regression algorithm (BSVR) to other high-powered techniques like 
Ridge Regression, Relevance Vector Machine, GPs, etc., with and without ARD (automatic 
relevance determination — essentially, a separable covariance function). Repeating their ex- 
periment for the treed GP LLM gave an average MSE of 6.96 compared to that of 6.99 for the 
BSVR with ARD, making the two algorithms by far the best in the comparison. However, 
without ARD the MSE of BSVR was 12.34, 1.77 times higher than the treed GP LLM, and 
the worst in the comparison. The reported results for a GP with (8.32) and without (9.13) 
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ARD showed the same effect, but to a lesser degree. Perhaps not surprisingly, the average 
MSEs do not tell the whole story. The 1st, median, and 3rd quartile MSEs we obtained for 
the treed GP LLM were 3.72, 5.32 and 8.48 respectively, showing that its distribution had a 
heavy right-hand tail. We take this as an indication that several responses in the data are 
either misleading, noisy, or otherwise very hard to predict. 

6 Conclusions 

Gaussian processes are a flexible modeling tool which can be an overkill for many applica- 
tions. We have shown how its limiting linear model can be both useful and accessible in 
terms of Bayesian posterior estimation, and prediction. The benefits include speed, parsi- 
mony, and a relatively straightforward implementation of a semiparametric model. When 
combined with treed partitioning the GP LLM extends the treed LM, resulting in a uniquely 
nonstationary, tractable, and highly accurate regression tool. 

We have focused on the separable power family of correlation functions, but the method- 
ology is by no means restricted to this family. All that is required is that the relevant family 
have a range parameter (like d) that yields the limiting (scaled) identity covariance matrix 
characterizing the LLM (like d ^ 0). For example, allowing a power < pi ^ 2 in the 
power family of Eq. ([2]) is straightforward. The separable Matern family also has the desired 
property: 

i^(xj,Xfc|p,0,d) = n 2P-ir(p+l/2) ^^^^'^'~^'^^^^^'^^'^'''-^^^'^'~^'^^^^^'^' 

where K,p is a modified Bessel function of the second kind. Unlike the power family, the 
isotropic Matern family does not arise as a special case where di = d, for i = 1, . . . , mx, but, 
as mentioned in Section HI the LLM methodology remains similarly applicable when there 
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is only one range parameter. 

One obvious extension of this work is to allow a larger class of "simple" models for 
the mean function of the process, such as higher-order polynomials or interactions between 



linear terms. Choos ing among simpler models can be done straightforward. 



Bayesian framework (JGeorge and McCulloch 



1993 



Geweke 



1996 



Joseph et al 



y wit 



lin the 



20081). Such 



an extension would likely allow more frequent selection of the simpler model, reducing the 
need for the GP. However, the clear understanding of what limiting cases lead to jumping 
from a GP to a linear model, as well as how to construct a set of booleans and their priors, 
would be lost in moving beyond linear models. Over a local region, a quadratic function 
would be well-approximated by a GP with a range parameter that could be neither very 
small nor large. Thus the approach contained herein would need further thought to efficiently 
extend it. There is also the tradeoff of the extra computing resources needed to test a larger 
set of models, whereas linear models can be worked directly into the current model fitting 
(via the auxiliary booleans), and do not require separate model selection steps. 



We believe that a large cont ribution of the GP 



design of computer experiments (JGramacy and Lee 



LM will be in the domain of sequential 



20081 ) which was the inspiration for much 



of the work presented here. Empirical evidence suggests that many computer experiments 
are nearly linear. That is, either the response is linear in most of its input dimensions, or the 
process is entirely linear in a subset of the input domain. Supremely relevant, but largely 
ignored in this paper, is that the Bayesian treed GP LLM provides a full posterior predictive 
distribution (particularly a nonstationary and thus region-specific estimate of predictive 
variance) which can be used towards active learning in the input domain. Exploitation of 
these characteristics should lead to an efficient framework for the adaptive exploration of 
computer experiment parameter spaces. 
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